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Abstract. Phylogenetic trees can be reconstructed from the matrix which contains 
the distances between all pairs of languages in a family. Recently, we proposed a 
new method which uses normalized Levenshtein distances among words with same 
meaning and averages on all the items of a given list. Decisions about the number of 
items in the input lists for language comparison have been debated since the beginning 
of glottochronology. The point is that words associated to some of the meanings have 
a rapid lexical evolution. Therefore, a large vocabulary comparison is only apparently 
more accurate then a smaller one since many of the words do not carry any useful 
information. In principle, one should find the optimal length of the input lists studying 
the stability of the different items. In this paper we tackle the problem with an 
automated methodology only based on our normalized Levenshtein distance. With 
this approach, the program of an automated reconstruction of languages relationships 
is completed. 
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1. Introduction 

Glottochronology tries to estimate the time at which languages diverged with the 
implicit assumption that vocabularies change at a constant average rate. The concept 
seems to have his roots in the work of the French explorer Dumont D'Urville. He 
collected comparative words lists of various languages during his voyages around the 
Astrolabe from 1826 to 1829 and, in his work about the geographical division of the 
Pacific P|, he introduced the concept of lexical cognates and proposed a method to 
measure the degree of relation among languages. He used a core vocabulary of 115 basic 
terms which, impressively, contains all but three the terms of the Swadesh 100- item list. 
Then, he assigned a distance from to 1 to any pair of words with same meaning and 
finally he was able to resolve the relationship for any pair of languages. His conclusion 
is famous: La langue est partout la meme. 

The method used by modern glottochronology, developed by Morris Swadesh [17] in 
the 1950s, measures distances from the percentage of shared cognates. Recent examples 
are the studies of Gray and Atkinson [5] and Gray and Jordan [6]. Cognates are words 
inferred to have a common historical origin, and cognacy decisions are made by trained 
and experienced linguists. Nevertheless, the task of counting the number of cognate 
words in a list is far from being trivial and results may vary for different studies. 
Furthermore, these decisions may imply an enormous working time. 

Recently, we proposed a new automated method [151 [13] which has some advantages, 
the first is that it avoids subjectivity the second is that results can be replicated by other 
scholars assuming that the database is the same, the third is that no specific linguistic 
knowledge is requested, and the last, but surely not the least, is that it allows for 
rapid comparison of a very large number of languages. We applied our method to the 
Indo-European and the Austronesian groups considering, in both cases, fifty different 
languages. 

In our work, we defined the distance of two languages by considering a normalized 
Levenshtein distance among words with the same meaning and we averaged on the two 
hundred words contained in a 200 words list [19J. The normalization, which takes into 
account word length, plays a crucial role, and no sensible results would have been found 
without. 

Almost at the same time, the above described automated method was used and 
developed by another large group of scholars [HE]. In their work, they used lists of 40 
words while we used lists of 200. Their choice was taken according to a careful study of 
the stability of different words. 

Decisions about the number of words in the input lists for languages comparison 
was debated since the beginning of glottochronology, Swadesh himself switched from 
200 words lists to 100 words ones. The point is that a large vocabulary comparison is 
only apparently more accurate, on the contrary, many of the words do not carry any 
information on language similarity, and their inclusion in the lists has the only effect 
of increasing the error noise that may hide the wanted results. In fact, words evolve 
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because of lexical changes, borrowings and replacement at a rate which is not the same 
for all of them. The speed of lexical evolution, is different for different meanings and it 
is probably related to the frequency of use of the associated words [12] . Those meanings 
with a high rate of change turns to be useless to establish relationships among languages. 
Furthermore the study of words stability has an interest in itself since it may give strong 
information on the activities which are at the core of the behavior of a social or ethnic 
group. 

The idea of inferring the stability of an item from its similarity in related languages 
goes back a long way in the lexicostatistical literature [8j [HI [18]. In this paper we 
tackle this problem with an automated methodology based on normalized Levenshtein 
distance. For any meaning, and any linguistic group, we are able to find a number 
which measure its stability (or degree of evolution speed) in a completely objective and 
reproducible manner. With this approach, the program of an automated reconstruction 
of languages relationships is completed. This is different from the approach in [H [9] 
since they have a combined approach, their lists are chosen according to a stability 
study which makes use of cognates, and then they reconstruct the languages phylogeny 
by using Levenshtein distance. 

In the next section we define the lexical distance between words and we also sketch 
our method for computing the time divergence between languages. Section 3 is the 
core of the paper, there we define the automated stability measures of the meanings 
and we make some preliminary study concerning distribution and ranking of stability 
for Indo-European languages. In section 4 we study correlations and Fouldy-Robinson 
differences associated to lists of different length. We take here the decision about the 
meanings that should be included in the lists. Conclusions and outlook are in section 5. 

2. Definition of distance 

We define here the lexical distance between two words which is a variant of the 
Levenshtein (or edit) distance. The Levenshtein distance is simply the minimum number 
of insertions, deletions, or substitutions of a single character needed to transform one 
word into the other. Our definition is taken as the edit distance divided by the number 
of characters in the longer of the two compared words. 

More precisely, given two words an and (3j their distance D(ai,j3j) is given by 



where D[(ai,/3j) is the Levenshtein distance between the two words and L(a i: (3j) is the 
number of characters of the longer of the two words a, and Therefore, the distance 
can take any value between and 1. Obviously D(oti, ctj) = . 

The normalization is an important novelty and it plays a crucial role; no sensible 
results can been found without [T5l [13]. 

We use distance between pairs of words, as defined above, to construct the lexical 
distances of languages. For any pair of languages, the first step is to compute the 
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distance between words corresponding to the same meaning in the Swadesh list. Then, 
the lexical distance between each languages pair is defined as the average of the distance 
between all words [T5l [13]. As a result we have a number between and 1 which we 
claim to be the lexical distance between two languages. 

Assume that the number of languages is N and the list of words for any language 
contains M items. Any language in the group is labeled a Greek letter (say a) and 
any word of that language by a« with 1 < i < M. Then, two words oti and /3j in the 
languages a and (3 have the same meaning (they corresponds to the same meaning) if 
i = j. 

Then the distance between two languages is 

%/ 5 )=4e%a) (2) 

i 

where the sum goes from 1 to M. Notice that only pairs of words with same meaning 
are used in this definition. This number is in the interval [0,1], obviously D(a, a) = 0. 

The results of the analysis is a N x iV upper triangular matrix whose entries are 
the N(N — 1) non trivial lexical distances D(a, f3) between all pairs in a group. Indeed, 
our method for computing distances is a very simple operation, that does not need any 
specific linguistic knowledge and requires a minimum of computing time. 

A phylogenetic tree can be constructed from the matrix of lexical distances D(a, /3), 
but this gives only the topology of the tree whereas the absolute time scale is missing. 
Therefore, we perform [T5J [13] a logarithmic transformation of lexical distances which 
is the analogous of the adjusted fundamental formula of glottochronology[16j. In this 
way we obtain a new N x N upper triangular matrix whose entries are the divergence 
times between all pairs of languages. This matrix preserves the topology of the lexical 
distance matrix but it also contains the information concerning absolute time scales. 
Then, the phylogenetic tree can be straightforwardly constructed. 

In [151 [13] we tested our method constructing the phylogenetic trees of the Indo- 
European group and of the Austronesian group. In both cases we considered N = 50 
languages. The database [19] that we used in [151 H3] is composed by M = 200 words 
for any language.. The main source for the database for the Indo-European group is 
the file prepared by Dyen et al. in [I]. For the Austronesian group we used as the main 
source the lists contained in the huge database in [7]. 

Criticism has been made to our proposal [10] on based on the fact that our 
reconstructed tree presents some incongruence as for example the early separation of 
Armenian which is not grouped together with Greek (which, in our tree separate just 
after Armenian). Nevertheless, the structure of the top of the Indo-European tree is 
debated and no universally accepted conclusion exists. 

In our previous work we have adopted the historically motivated choice of 200 words 
lists with the meanings proposed by Swadesh. Our aim, in this paper, is to establish in 
a objective manner the proper length and the composition of the lists. In order to reach 
this goal we need to separately study the stability of any meaning. 
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Figure 1. Stability histogram of meanings for Indo-European languages. The fat tail 
on the right of the histogram indicates that some items have a very large stability. 



3. Stability of meanings 

We take now decisions concerning stability of meanings. Our aim is to obtain an 
automated procedure, which avoids, also at this level, the use of cognates. For this 
purpose, it is necessary to obtain a measure of the typical distance of all pairs of words 
corresponding to a given meaning in a language family The meaning is indicated by 
the label i and aj is the corresponding word in the language a. Therefore, we define the 
stability as: 

g(i) = 1 " jy(jy-i) p KA) (3) 

where the sum goes on all possible N(N — 1)/2 possible language pairs a, [3 in the family 
using the fact that D(aii,Pi) = D{(5 i: ai). 

With this definition, S(i) is inversely proportional to the average of the distances 
D(a i: j3i) and takes a value between and 1. The averaged distance is smaller for those 
words corresponding to meanings with a lower rate of lexical evolution since they tend to 
remain more similar in two languages. Therefore, to a larger S(i) corresponds a grater 
stability. 

We computed the S(i) for the 200 meanings of 50 languages of the Indo-European 
family. To have a first qualitative understanding of the distribution of the S(i) we plot 
the associated histogram in Fig 1. We can see that there is a fat tail on the right 
of the histograms indicating that there are some meaning with a quite large stability. 
This tail is at very variance with a standard Gaussian behavior. The same result are 
obtained if we consider the Austronesian family instead. We remark that similar plots 
were computed in [12] were the rates of lexical evolution are obtained by the standard 
glottochronology approach. 
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Figure 2. Stability in a decreasing rank for the 200 meanings of the Indo-European 
languages. At the beginning stability has large values but drops rapidly, then, between 
the 50th position and the 180th it decreases linearly, finally it drops again. The straight 
line between position 51 and position 180 underlines the initial and final deviation from 
the linear behavior. 



To understand better the behavior of the stability distribution, we plotted S(i), 
in decreasing rank, for the 200 meaning in the list. In Fig. 2 are reported the data 
concerning Indo-European family. At the beginning the stability drops rapidly, then, 
between the 50th position and the 180th it decreases slowly and almost linearly with 
rank, finally at the end the stability drops again. We stress again that this behaviour 
is not Gaussian for which high and low stability part of the curve would be symmetric. 
The curve is fitted by a straight line in the central part of the data, between position 
51 and position 180, in order to highlight the initial and final deviation from the linear 
behavior. We remark that the qualitative behavior for the Austronesian family is exactly 
the same. 

A preliminary conclusion is that one should surely keep all the meanings with higher 
information, take at least some of the most stable meanings in the linear part of the 
curve and exclude completely those meanings with lower information. Nevertheless, at 
this stage it is difficult to say how many items should be maintained, since this number 
could be any between 50 and 180. 

It is necessary a deeper analysis of the stability to reach a conclusion. Indeed, 
we need to know what is the minimum number of meaning which allows for a precise 
computation of distances between languages and, consequently, permits an accurate 
construction of the phylogenetic tree. In order to reach this goal we need a careful 
analysis of correlations among distances computed with the whole list and distances 
computed with shorter lists. It is also necessary to compare the phylogenetic trees by a 
proper measure, the most natural being the Robinson-Foulds difference. 
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Figure 3. Correlation coefficient c(n) of distances for Indo-European languages. The 
coefficient c(n) measures the correlation between the distances estimated with n items 
and the distances estimated with 200 items. c(n) reaches a value larger than 99% at 
n = 100. 

4. Correlations 

As mentioned in the previous section, first of all we need to evaluate the impact of 
shorter lists on our estimate of the distances between languages. In order to reach this 
goal, we compute the correlation coefficient c(n) between distances D(a,j3) obtained 
by the whole list of 200 items and the distances D n (a,f3) obtained only by the most n 
stable items (obviously, D(a, /3)=-D 2 oo( a ; (5)). 

The correlation coefficient c(n) is computed in a standard way, using averages over 
all possible pairs of languages and it takes the value 1 only when there is complete 
coincidence between D n (a,j3) and D(a,j3) . The correlation is plotted in Fig 3 for the 
Indo-European family. Also in this case similar results are found if the Austronesian 
family is considered. 

From the figure one can observe that the correlation reaches a value larger than 
99% with 100 meanings. 

The problem, is again that our choice for the length of the lists depends on our 
choice for the minimum excepted correlation coefficient. If we accept 97% we are satisfied 
by lists of 50 meanings while if we need 99% we have to take lists of 100 meanings. 

To resolve this problem we estimated the Robinson- Foulds difference [14] between 
the trees generated stating from D n (a, (3) and the tree generated starting from D(a>, (3). 
The RF difference, which is plotted in Figure 3 for the Indo-European family, measures 
the degree of similarity between two trees. At lower values correspond trees which are 
more similar. 

As one can see from Fig. 4, the RF difference drops rapidly until n ~ 100. Than it 
remains almost constant for all values greater then n = 100 (the RF difference is equal 
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Figure 4. Robinson-Foulds difference between trees of Indo-European languages 
computed with lists of 200 items and lists of n items. The RF difference measures 
the degree of similarity between trees. More similar trees have a smaller difference. 
The RF difference drops rapidly until n ~ 100, than it remains almost constant for all 
greater values of n. 



to zero when n = 200 but this is expected since D 200 = D(a,j3)) This result says 
that with 100 meanings one is able to capture all the information regarding languages 
distance and larger lists produce the same output. In other words, the 100 meanings 
which have been eliminated carry small, if not vanishing, information. 

The complete list of the most 100 stable terms for the Indo-European group can 
be found in Table 1. The list is ordered by ranking, and the stability value is written 
correspondingly to any item. 

In conclusion, one has to consider lists with the 100 meaning with higher stability, 
compute the matrix of lexical distances, transform in the matrix of divergence times 
and, finally, construct the tree. The elimination of the 100 items with lower stability 
has the positive effect of reducing the working time necessary for an accurate check of all 
items and, therefore, reducing errors due to misspelling or inaccurate transliterations. 
Furthermore, shorter lists allow for comparison of languages whose available vocabulary 
is small. 

5. Discussion and conclusions 

In previous works [T5], [13] we proposed an automated method for evaluating the distance 
between languages. Here we propose a method that is also automatic and gives lists of 
the mosts table meanings. The novelty is that combining [15l [13] with the results 
presented here everything can be done automatically. Stable meanings, distances, 
divergence times and phylogenetic trees can be all obtained by using simple objective 
arguments based on normalized Levenshtein distance. 
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Table 1. List of the 100 most stable meanings according to the S(i) measure described 
in the text. 



Word 


S(t) 


Word 


s{i) 


Word 




Word 


S(i) 


you 





45395 


three 





44102 


mother 





36627 


not 





35033 


new 





31961 


nose 





3169 


four 





30226 


night 





29403 


two 





28214 


name 





27962 


tooth 





27677 


star 





27269 


salt 





26792 


day 





26695 


grass 





26231 


sea 





25906 


die 





25602 


sun 





25535 


one 





23093 


feather 





23055 


give 





22864 


sit 





22757 


stand 





22644 


meat 





2261 


long 





22491 


five 





22353 


hand 





22261 


short 





21676 


father 





21319 


smoke 





21213 


far 





20998 


worm 





20846 


dry 





207 


scratch 





20343 


person 





20129 


when 





20011 


wind 





19535 


snake 





19485 


sing 





19434 


stone 





19369 


suck 





19196 


mouth 





19067 


dig 





19052 


live 





18716 


root 





18715 


hair 





18522 


smooth 





18457 


water 





18378 


tongue 





18194 


animal 





1819 


year 





17892 


red 





17815 


man 





17801 


tie 





17789 


snow 





17697 


sew 





17686 


there 





17657 


breathe 





17578 


flower 





17566 


mountain 





17545 


fruit 





17508 


bark 





17502 


sand 





17443 


leaf 





1739 


warm 





17283 


green 





17269 


liver 





17205 


hunt 





17168 


sky 





17156 


know 





17117 


bone 





17056 


spit 





17036 


heart 





17023 


pull 





16984 


right 





1689 


we 





16858 


husband 





16853 


foot 





1683 


drink 





16828 


see 





16764 


lie 





16763 


fish 





16693 


woman 





16656 


louse 





16624 


straight 





16534 


yellow 





16487 


sleep 





1643 


black 





16408 


who 





16351 


seed 





16299 


wing 





16288 


cut 





16245 


count 





16173 


thin 





16156 


sharp 





1611 


float 





16028 


fall 





15968 


earth 





15965 


kill 





15926 


burn 





15918 



We do not claim that our combined method produces better results then the 
standard glottochronology approach, but surely comparable. The advantages of this 
approach can be summarized here as follows: it avoids subjectivity since all results 
can be replicated by other scholars assuming that the database is the same; it allows 
for rapid comparison of a very large number of languages; can be used also for those 
languages groups for which the use of cognates is very complicated or even impossible. 
In fact, the only work is to prepare the lists, while all the remaining work is made by a 
computer program. 

We would like to mention that recently, together with other scholars [2|, we 
have applied the method described here as a starting point for a deeper analysis of 
relationships among languages. The point is that a tree is only an approximation, which, 
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obviously, skips more complex phenomena such as horizontal transfer. These phenomena 
are reflected into the matrix of distances as deviations from the ultra-metric structure. 
It seems that the approach in [2] allows for some more accurate understanding of some 
important topics, such as migration patterns and homelands locations of families of 
languages. 
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